For the past few years I’ve been a statistics tutor in the psychology department at Maynooth University. We cover everything from the basics of descriptive statistics and graphs to the nitty-gritty of t-tests, correlations, and ANOVAs. However, there’s one topic where I always slow down and dive deep: regression analysis. Why? Because linear regression is basically the Swiss Army knife of statistics—it’s everywhere, it’s versatile, and it’s way cooler than it sounds.
Seriously, linear regression is like that one friend who shows up at every party. Lectures? Check. Assignments? Obviously. Research papers? You bet. Even casual chats about data somehow always circle back to it. It’s the ultimate tool for understanding how variables play together. Whether you’re predicting exam scores based on study hours (or lack thereof) or calculating how much coffee it takes to survive finals week (spoiler: a lot), regression has your back.
One thing I always like doing is having students calculate regression coefficients by hand. It’s a great way to understand the mechanics of the model and appreciate the magic of least squares. But, full disclosure: I always forget the formulas mid-session and end up frantically Googling them. So, I decided to write them down here—partly for fun, partly as a future reference for myself (hey future me!!).
The linear regression model
At its core, linear regression is all about finding relationships between variables. The model is a simple equation that describes how a dependent variable (Y) relates to one or more independent variables (X). In its simplest form, it looks like this:
Y = \beta_0 + \beta_1X + \epsilon
where:
Y the dependent variable (the thing you’re trying to predict).
X the independent variable (the thing you’re using to make predictions).
\beta_0 is the intercept (where the line crosses the Y-axis).
\beta_1 is the slope (how much Y changes for every unit change in X).
\epsilon the error term (because let’s face it, life is messy).
To make this more concrete, let’s create a small dataset in R. We’ll simulate some data for study hours and exam scores, and then use linear regression to model the relationship between them.
Show the code
set.seed(123)# Hours of study per week (our x variable)study_hours<-rnorm(10, mean =7, sd =2)# Exam scores (our y variable)exam_scores<-25+5*study_hours+rnorm(10, mean =0, sd =2)test_data<-data.frame( study_hours =ceiling(study_hours), exam_scores =ceiling(exam_scores))head(test_data)
If we plot this data we can see the relationship between study hours and exam scores:
Show the code
exam_plot<-test_data|>ggplot(aes(x =study_hours, y =exam_scores))+geom_jitter(size =3, width =0.1, height =0.1)+labs( title ="Exam scores vs. study hours", x ="Study hours", y ="Exam scores")exam_plot
Figure 1: Scatter plot of exam scores vs. study hours. Each point represents a student’s exam score based on the number of study hours per week.
From the plot, you can see a clear trend: the more hours a student spends studying, the higher their exam score tends to be. This is the relationship we want to model using linear regression.
Fitting a model in R
In R, fitting a linear regression model is as simple as using the lm() function. We provide the formula for the model and the dataset, and R does the rest. Let’s fit a linear model to our data:
Show the code
fit<-lm(exam_scores~study_hours, data =test_data)broom::tidy(fit)
The lm() function fits a linear model where exam_scores is the dependent variable (Y) and study_hours is the independent variable (X). The output gives us the estimated coefficients for the intercept (\beta_0) and slope (\beta_1).
The intercept (\beta_0) is 21.33, meaning a student who studies 0 hours is predicted to score approximately 21% on the exam.
The slope (\beta_1) is 5.23, indicating that for every additional hour of study, a student’s predicted exam score increases by 5.23%.
Visually this looks like
Show the code
exam_plot<-exam_plot+geom_smooth( method ="lm", colour =ggokabeito::palette_okabe_ito(order =5), se =FALSE)+# Adding equation of the lineannotate("text", x =6, y =80, label =paste0("Y = ", round(coef(fit)[1], 2), " + ", round(coef(fit)[2], 2), "X"), color =ggokabeito::palette_okabe_ito(order =5), size =rel(5), fontface ="bold")exam_plot
Figure 2: Scatter plot of exam scores vs. study hours with a linear regression line. The equation of the line is displayed on the plot.
But how are these values calculated
Calculating Regression Coefficients by Hand
While R’s lm() function makes fitting a linear regression model incredibly easy, it’s helpful to understand how the coefficients (\beta_0 \: \text{and} \: \beta_1) are derived.
The goal of linear regression is to find the best-fitting line through the data, which is done by minimizing the sum of squared errors. The sum of squared errors is calculated as:
SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
where:
Y_i is the actual value of the dependent variable.
\hat{Y_i} is the predicted value of the dependent variable.
n is the number of observations.
Show the code
# Add predicted values to the datasettest_data<-test_data|>dplyr::mutate(predicted =coef(fit)[1]+coef(fit)[2]*study_hours)# Visualizing SSEexam_plot+geom_segment( data =test_data,aes(x =study_hours, xend =study_hours, y =exam_scores, yend =predicted), linetype ="dashed", colour =ggokabeito::palette_okabe_ito(order =1), linewidth =1)b_1<-188.8/36.161.6-b_1*7.7
[1] 21.32964
Figure 3: Visual representation of the sum of squared errors (SSE) in linear regression. The goal is to minimize the vertical distance between the data points and the regression line.
To minimize the SSE, we need to find the values of \beta_0 (intercept) and \beta_1 (slope) that make the line fit the data as closely as possible. The formulas for \beta_0 and \beta_1 are:
---title: "Doing linear regression by hand"description: "A step-by-step guide to calculating regression coefficients by hand"date: "2025-03-12"format: html: html-math-method: katex shift-heading-level-by: 1 lightbox: truecitation: url: "https://c-monaghan.github.io/posts/2025/"execute: warning: falseeditor: source---For the past few years I've been a statistics tutor in the psychology department at Maynooth University. We cover everything from the basics of descriptive statistics and graphs to the nitty-gritty of t-tests, correlations, and ANOVAs. However, there's one topic where I always slow down and dive deep: regression analysis. Why? Because linear regression is basically the Swiss Army knife of statistics—it’s everywhere, it’s versatile, and it’s way cooler than it sounds.Seriously, linear regression is like that one friend who shows up at every party. Lectures? Check. Assignments? Obviously. Research papers? You bet. Even casual chats about data somehow always circle back to it. It’s the ultimate tool for understanding how variables play together. Whether you’re predicting exam scores based on study hours (or lack thereof) or calculating how much coffee it takes to survive finals week (spoiler: *a lot*), regression has your back.One thing I always like doing is having students calculate regression coefficients by hand. It's a great way to understand the mechanics of the model and appreciate the magic of least squares. But, full disclosure: I *always* forget the formulas mid-session and end up frantically Googling them. So, I decided to write them down here—partly for fun, partly as a future reference for myself ([hey future me!!](https://media.tenor.com/0CpFOKGVaeMAAAAi/hand-waving-hand.gif])).# The linear regression modelAt its core, linear regression is all about finding relationships between variables. The model is a simple equation that describes how a dependent variable $(Y)$ relates to one or more independent variables $(X)$. In its simplest form, it looks like this:$$ Y = \beta_0 + \beta_1X + \epsilon $$where: - $Y$ the dependent variable (the thing you’re trying to predict).- $X$ the independent variable (the thing you’re using to make predictions).- $\beta_0$ is the intercept (where the line crosses the Y-axis).- $\beta_1$ is the slope (how much $Y$ changes for every unit change in $X$).- $\epsilon$ the error term (because let’s face it, life is messy).## Generating some dataFirst let's load some packages and set our theme```{r setting up}# Loading packageslibrary(ggplot2)# Setting default themetheme_set( theme_minimal() + theme( plot.title = element_text(hjust = 0.5, face = "bold", size = rel(1.4)), axis.title = element_text(size = rel(1)) ))```To make this more concrete, let’s create a small dataset in R. We’ll simulate some data for study hours and exam scores, and then use linear regression to model the relationship between them.```{r generating_data}set.seed(123)# Hours of study per week (our x variable)study_hours <- rnorm(10, mean = 7, sd = 2)# Exam scores (our y variable)exam_scores <- 25 + 5 * study_hours + rnorm(10, mean = 0, sd = 2)test_data <- data.frame( study_hours = ceiling(study_hours), exam_scores = ceiling(exam_scores))head(test_data)```If we plot this data we can see the relationship between study hours and exam scores:```{r plot_data}#| fig-width: 12#| label: fig-exam-scores#| fig-cap: Scatter plot of exam scores vs. study hours. Each point represents a student's exam score based on the number of study hours per week.exam_plot <- test_data |>ggplot(aes(x = study_hours, y = exam_scores)) + geom_jitter(size = 3, width = 0.1, height = 0.1) + labs( title = "Exam scores vs. study hours", x = "Study hours", y = "Exam scores" )exam_plot```From the plot, you can see a clear trend: **the more hours a student spends studying, the higher their exam score tends to be**. This is the relationship we want to model using linear regression.## Fitting a model in RIn R, fitting a linear regression model is as simple as using the `lm()` function. We provide the formula for the model and the dataset, and R does the rest. Let's fit a linear model to our data:```{r fitting_model}fit <- lm(exam_scores ~ study_hours, data = test_data)broom::tidy(fit)```The `lm()` function fits a linear model where `exam_scores` is the dependent variable $(Y)$ and `study_hours` is the independent variable $(X)$. The output gives us the estimated coefficients for the intercept $(\beta_0)$ and slope $(\beta_1)$.- The intercept $(\beta_0)$ is **21.33**, meaning a student who studies 0 hours is predicted to score approximately 21% on the exam.- The slope $(\beta_1)$ is **5.23**, indicating that for every additional hour of study, a student's predicted exam score increases by 5.23%.Visually this looks like```{r visualise-model}#| fig-width: 12#| label: fig-exam-scores-regression#| fig-cap: Scatter plot of exam scores vs. study hours with a linear regression line. The equation of the line is displayed on the plot.exam_plot <- exam_plot + geom_smooth( method = "lm", colour = ggokabeito::palette_okabe_ito(order = 5), se = FALSE) + # Adding equation of the line annotate( "text", x = 6, y = 80, label = paste0( "Y = ", round(coef(fit)[1], 2), " + ", round(coef(fit)[2], 2), "X"), color = ggokabeito::palette_okabe_ito(order = 5), size = rel(5), fontface = "bold" )exam_plot```**But how are these values calculated**# Calculating Regression Coefficients by HandWhile R’s `lm()` function makes fitting a linear regression model incredibly easy, it’s helpful to understand how the coefficients $(\beta_0 \: \text{and} \: \beta_1)$ are derived.The goal of linear regression is to find the best-fitting line through the data, which is done by minimizing the sum of squared errors. The sum of squared errors is calculated as: $$ SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 $$where: - $Y_i$ is the actual value of the dependent variable.- $\hat{Y_i}$ is the predicted value of the dependent variable.- $n$ is the number of observations.```{r visualise-SSE}#| fig-width: 12#| label: fig-SSE#| fig-cap: Visual representation of the sum of squared errors (SSE) in linear regression. The goal is to minimize the vertical distance between the data points and the regression line.# Add predicted values to the datasettest_data <- test_data |> dplyr::mutate(predicted = coef(fit)[1] + coef(fit)[2] * study_hours)# Visualizing SSEexam_plot + geom_segment( data = test_data, aes(x = study_hours, xend = study_hours, y = exam_scores, yend = predicted), linetype = "dashed", colour = ggokabeito::palette_okabe_ito(order = 1), linewidth = 1)b_1 <- 188.8 / 36.161.6 - b_1 * 7.7```To minimize the SSE, we need to find the values of $\beta_0$ (intercept) and $\beta_1$ (slope) that make the line fit the data as closely as possible. The formulas for $\beta_0$ and $\beta_1$ are:\begin{align*}\beta_1 &= \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} \\[12pt]\beta_0 &= \bar{Y} - \beta_1\bar{X}\end{align*}where:- $X_i$ and $Y_i$ are the individual data points.- $\bar{X}$ and $\bar{Y}$ are the means of the independent and dependent variables, respectively.## Calculating the slope $(\beta_1)$Let’s calculate $\beta_1$ step by step using the following data:| Student | Study Hours (X) | Exam Scores (Y) ||:-------:|:---------------:|:---------------:|| 1 | 6 | 57 || 2 | 7 | 59 || 3 | 11 | 77 || 4 | 8 | 61 || 5 | 8 | 61 || 6 | 11 | 81 || 7 | 8 | 66 || 8 | 5 | 44 || 9 | 6 | 55 || 10 | 7 | 55 |We need to compute the following:- Mean of $X \: (\bar{X})$ and mean of $Y \: (\bar{Y})$.- Deviations from the mean: $(X_i - \bar{X})$ and $(Y_i - \bar{Y})$.- Products of deviations: $(X_i - \bar{X})(Y_i - \bar{Y})$.- Squared deviations of $X$: $(X_i - \bar{X})^2$.| Student | $X$ | $Y$ | $(x_i - \bar{x})(y_i - \bar{y})$ | $(x_i - \bar{x})^2$ ||:-------:|:---:|:----:|:--------------------------------:|:-------------------:|| 1 | 6 | 57 | 7.82 | 2.89 || 2 | 7 | 59 | 1.82 | 0.49 || 3 | 11 | 77 | 50.82 | 10.89 || 4 | 8 | 61 | -0.18 | 0.09 || 5 | 8 | 61 | -0.18 | 0.09 || 6 | 11 | 81 | 64.02 | 10.89 || 7 | 8 | 66 | 1.32 | 0.09 || 8 | 5 | 44 | 47.52 | 7.29 || 9 | 6 | 55 | 11.22 | 2.89 || 10 | 7 | 55 | 4.62 | 0.49 || $\sum$ | 77 | 616 | 188.8 | 36.1 || Mean | 7.7 | 61.6 | | |Using the formula we now get:$$\beta_1 = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} = \frac{188.8}{36.1} \approx 5.23$$## Calculating the intercept $(\beta_0)$Using the formula we now get:$$\beta_0 = \bar{Y} - \beta_1\bar{X} = 61.6 - 5.23(7.7) \approx 21.32$$Hence our regression equation is $\hat{Y} = 21.3 + 5.23X$ which is the same as our `lm()` model in R.